The purpose of this project is to create a machine learning model that can predict the price of a diamond based on its charateristics. The purpose of this file is to do some initial exploratory analysis of the diamonds data set. This data set is provided in the ggplot2 package. This inital exploratory analysis will attempt to answer the following questions:
Best of the Best: Imagine someone wants to buy the best possible diamond, and money is no object. They only want to consider diamonds in the top categories of cut (Ideal), color (D), and clarity (IF). They want the most ideal range for depth (59-63) and table (54-57). Within the dataset, if we plot carat versus price can we fit a clean trendline? Is it linear? Exponential? What’s the price of the largest carat, and is it the most expensive?
Depth and Table Percentages: I found the ideal depth and table values mentioned above online, but let’s explore the dataset a little. if we fix the 4 C’s (carat, cut, color, and clarity), how much do depth and table impact price? If we widen the ranges slightly, can we save a substantial amount?
Best Bang for the Buck: Imagine someone wants to find the diamond which maximizes cut, color, and clarity per dollar. Using the expanded depth and table values from question 2 above, when does price start to increase exponentially for cut? What about for color? And clarity?
Bigger is Better: Imagine a guy named Bob who wants to buy a pair of diamonds for his wife, and have them made into earrings for her birthday. In Bob’s mind, size (carat) is all that matters. He has $3200. He needs Two diamonds with the exact same cut, color, and clarity (with very comparable depth and table values), and he wants them to be as big as possible. What size carat can he afford? If he adjusts his budget, how much does the “maximum carat” size shift? Can we plot that and fit a line to it to find the “knee in the curve”?
A sample, summary, and first glimpse of the diamonds data set is provided below:
head(data)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
summary(data)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 J: 2808 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 I: 5422 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 H: 8304 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 F: 9542 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 E: 9797 VVS1 : 3655 Max. :79.00
## D: 6775 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
glimpse(data)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...
There are 53940 rows of data. Each row is one observation / one diamond. There are 10 columns of data. Each column is a variable / feature. The features are:
names(data)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
price: price in US dollars (326 to 18823)
carat: weight of the diamond (0.2 to 5.01)
cut: quality of the cut (in order of worst to best; Fair, Good, Very Good, Premium, Ideal)
color: diamond color (in order of worst to best; J, I, H, G, F, E, D)
clarity: a measurement of how clear the diamond is (in order of worst to best; I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)
x: length in mm (0 to 10.74)
y: width in mm (0 to 58.9)
z: depth in mm (0 to 31.8)
depth: total depth percentage = 2 * z / (x + y) (43 to 79)
table: width of the top of a diamond relative to its widest point (43 to 95)
First we need to filter the data set to only the best cut, color, clarity, depth, and table.
df <- data %>%
filter(
cut=="Ideal",
color=="D",
clarity=="IF",
between(depth, 59, 63),
between(table, 54, 57))
print(df)
## # A tibble: 24 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.51 Ideal D IF 62 56 3446 5.14 5.18 3.2
## 2 0.51 Ideal D IF 62.1 55 3446 5.12 5.13 3.19
## 3 0.53 Ideal D IF 61.5 54 3517 5.27 5.21 3.22
## 4 0.53 Ideal D IF 62.2 55 3812 5.17 5.19 3.22
## 5 0.59 Ideal D IF 60.9 57 4208 5.4 5.43 3.3
## 6 0.56 Ideal D IF 62.4 56 4216 5.24 5.28 3.28
## 7 0.56 Ideal D IF 61.9 57 4293 5.28 5.31 3.28
## 8 0.63 Ideal D IF 62.5 55 6549 5.47 5.5 3.43
## 9 0.63 Ideal D IF 62.5 55 6607 5.5 5.47 3.43
## 10 1.04 Ideal D IF 61.8 57 14494 6.49 6.52 4.02
## # ... with 14 more rows
This filters the data to only 24 options. Let’s plot carat versus price.
ggplot(df, aes(x=carat, y=price)) +
geom_point()
Looks like a linear model could be viable. Let’s create one.
model_df_lm <- lm(price ~ carat, df)
summary(model_df_lm)
##
## Call:
## lm(formula = price ~ carat, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2123.5 -1264.7 245.6 1026.5 2344.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5621.9 623.1 -9.023 7.57e-09 ***
## carat 20259.9 911.8 22.221 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1288 on 22 degrees of freedom
## Multiple R-squared: 0.9573, Adjusted R-squared: 0.9554
## F-statistic: 493.8 on 1 and 22 DF, p-value: < 2.2e-16
Carat passes the T test (p-value) by a mile and the R squared isn’t too bad. Let’s plot this model now.
pred_df_lm <- predict(model_df_lm, df)
df <- df %>%
mutate(pred = pred_df_lm)
ggplot(df, aes(x=carat, y=price)) +
geom_point(color="blue") +
geom_line(color="red", aes(y=pred))
Looks like the most expensive diamond might not be the largest one. Let’s find these two data points.
df2 <- df %>%
filter(
carat == max(carat) |
price == max(price)
)
print(df2)
## # A tibble: 2 x 11
## carat cut color clarity depth table price x y z pred
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.07 Ideal D IF 60.9 54 17042 6.66 6.73 4.08 16056.
## 2 1.03 Ideal D IF 62 56 17590 6.55 6.44 4.03 15246.
Interesting! The most expensive diamond is the not the largest diamond, but it has a larger depth and table compared to the largest carat diamond. This needs a closer look.
First let’s create a scatter plot of price vs carat.
ggplot(data, aes(x=carat, y=price)) +
geom_point()
Next, let’s create a scatter plot of price vs depth.
ggplot(data, aes(x=depth, y=price)) +
geom_point()
Now let’s create a scatter plot of price vs table.
ggplot(data, aes(x=table, y=price)) +
geom_point()
A 3D scatter plot of price vs carat vs depth.
plot_ly(data=data, x=~price, y=~carat, z=~depth, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
plot_ly(data=data, x=~price, y=~carat, z=~depth, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="depth")))
A 3D scatter plot of price vs carat vs table.
plot_ly(data=data, x=~price, y=~carat, z=~table, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="Table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~cut, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~color, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
plot_ly(data=data, x=~price, y=~carat, z=~table, color=~clarity, type = "scatter3d", marker = list(size=1)) %>%
layout(scene= list(xaxis=list(title="Price"), yaxis=list(title="Carat"), zaxis=list(title="table")))
Under development.
Under development.